Transformer-based models have been widely demonstrated to be successful in computer vision tasks by modelling long-range dependencies and capturing global representations. However, they are often dominated by features of large patterns leading to the loss of local details (e.g., boundaries and small objects), which are critical in medical image segmentation. To alleviate this problem, we propose a Dual-Aggregation Transformer Network called DuAT, which is characterized by two innovative designs, namely, the Global-to-Local Spatial Aggregation (GLSA) and Selective Boundary Aggregation (SBA) modules. The GLSA has the ability to aggregate and represent both global and local spatial features, which are beneficial for locating large and small objects, respectively. The SBA module is used to aggregate the boundary characteristic from low-level features and semantic information from high-level features for better preserving boundary details and locating the re-calibration objects. Extensive experiments in six benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods in the segmentation of skin lesion images, and polyps in colonoscopy images. In addition, our approach is more robust than existing methods in various challenging situations such as small object segmentation and ambiguous object boundaries.
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
多模式变压器的最新努力通过合并视觉和文本信息改善了视觉上丰富的文档理解(VRDU)任务。但是,现有的方法主要集中于诸如单词和文档图像贴片之类的细粒元素,这使得他们很难从粗粒元素中学习,包括短语和显着视觉区域(如突出的图像区域)等自然词汇单元。在本文中,我们对包含高密度信息和一致语义的粗粒元素更为重要,这对于文档理解很有价值。首先,提出了文档图来模拟多层次多模式元素之间的复杂关系,其中通过基于群集的方法检测到显着的视觉区域。然后,提出了一种称为mmlayout的多模式变压器,以将粗粒的信息纳入基于图形的现有预训练的细颗粒的多峰变压器中。在mmlayout中,粗粒信息是从细粒度聚集的,然后在进一步处理后,将其融合到细粒度中以进行最终预测。此外,引入常识增强以利用天然词汇单元的语义信息。关于四个任务的实验结果,包括信息提取和文档问答,表明我们的方法可以根据细粒元素改善多模式变压器的性能,并使用更少的参数实现更好的性能。定性分析表明,我们的方法可以在粗粒元素中捕获一致的语义。
translated by 谷歌翻译
近年来,随着新颖的策略和应用,神经网络一直在迅速扩展。然而,尽管不可避免地会针对关键应用程序来解决这些挑战,例如神经网络技术诸如神经网络技术中仍未解决诸如神经网络技术的挑战。已经尝试通过用符号表示来表示和嵌入域知识来克服神经网络计算中的挑战。因此,出现了神经符号学习(Nesyl)概念,其中结合了符号表示的各个方面,并将常识带入神经网络(Nesyl)。在可解释性,推理和解释性至关重要的领域中,例如视频和图像字幕,提问和推理,健康信息学和基因组学,Nesyl表现出了有希望的结果。这篇综述介绍了一项有关最先进的Nesyl方法的全面调查,其原理,机器和深度学习算法的进步,诸如Opthalmology之类的应用以及最重要的是该新兴领域的未来观点。
translated by 谷歌翻译
近年来,在实际场景中,单图(SID)引起了人们的关注。由于难以获得真实世界/清洁图像对,因此以前的真实数据集遭受了低分辨率图像,均匀的雨条,背景变化有限,甚至对图像对的不对准,从而对SID方法进行了不可思议的评估。为了解决这些问题,我们建立了一个名为Realrain-1K的新的高质量数据集,该数据集分别由1,120美元的高分辨率配对的清洁和高雨图像组成,分别具有低密度和高密度降雨条纹。 Realrain-1K中的图像是通过简单而有效的降雨密度可控制的过滤方法自动从大量现实世界中的雨滴剪辑中生成结盟。 Realrain-1K还提供丰富的雨条层作为副产品,使我们能够通过将雨条层粘贴在丰富的自然图像上,从而构建一个名为Synrain-13K的大规模合成数据集。基于它们和现有数据集,我们在三个曲目上基准了10种代表性的SID方法:(1)对Realrain-1K的全面监督学习,(2)域对真实数据集进行概括,以及(3)SYN-to-eal Toth-to to real Transvers Learning 。实验结果(1)显示了图像恢复性能和模型复杂性中代表性方法的差异,(2)验证所提出的数据集在模型概括中的重要性,(3)提供了有关从不同领域和从不同领域和学习的优越性的有用见解。关于现实世界中SID的未来研究的灯光。数据集将在https://github.com/hiker-lw/realrain-1k上发布
translated by 谷歌翻译
Colonoscopy, currently the most efficient and recognized colon polyp detection technology, is necessary for early screening and prevention of colorectal cancer. However, due to the varying size and complex morphological features of colonic polyps as well as the indistinct boundary between polyps and mucosa, accurate segmentation of polyps is still challenging. Deep learning has become popular for accurate polyp segmentation tasks with excellent results. However, due to the structure of polyps image and the varying shapes of polyps, it is easy for existing deep learning models to overfit the current dataset. As a result, the model may not process unseen colonoscopy data. To address this, we propose a new state-of-the-art model for medical image segmentation, the SSFormer, which uses a pyramid Transformer encoder to improve the generalization ability of models. Specifically, our proposed Progressive Locality Decoder can be adapted to the pyramid Transformer backbone to emphasize local features and restrict attention dispersion. The SSFormer achieves stateof-the-art performance in both learning and generalization assessment.
translated by 谷歌翻译
我们提出了一个新颖的深度学习框架,称为迭代优化的补丁标签推理网络(IOPLIN),用于自动检测不仅限于特定的路面困扰,例如裂缝和坑洼。 Ioplin可以通过预期最大化启发的补丁标签蒸馏(EMIPLD)策略进行迭代训练,并通过从路面图像中推断贴片标签来很好地完成此任务。 Ioplin在最先进的单个分支CNN模型(例如Googlenet和ExcelificeNet)上享有许多理想的属性。它能够处理不同分辨率中的图像,并充分利用图像信息,尤其是对于高分辨率图像,因为Ioplin从未修复的图像贴片中提取了视觉特征,而不是整个大小的整个图像。此外,它可以在训练阶段使用任何先前的本地化信息而大致地将路面困扰定位。为了更好地评估我们方法在实践中的有效性,我们构建了一个名为CQU-BPDD的大规模沥青疾病检测数据集,该数据集由60,059个高分辨率路面图像组成,这些数据集在不同的时间从不同地区获取。该数据集的广泛结果证明了Ioplin在自动路面遇险检测中的最先进图像分类方法的优势。 The source codes of IOPLIN are released on \url{https://github.com/DearCaat/ioplin}, and the CQU-BPDD dataset is able to be accessed on \url{https://dearcaat.github.io/CQU -bpdd/}。
translated by 谷歌翻译
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.
translated by 谷歌翻译
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.
translated by 谷歌翻译
With the advanced request to employ a team of robots to perform a task collaboratively, the research community has become increasingly interested in collaborative simultaneous localization and mapping. Unfortunately, existing datasets are limited in the scale and variation of the collaborative trajectories, even though generalization between inter-trajectories among different agents is crucial to the overall viability of collaborative tasks. To help align the research community's contributions with realistic multiagent ordinated SLAM problems, we propose S3E, a large-scale multimodal dataset captured by a fleet of unmanned ground vehicles along four designed collaborative trajectory paradigms. S3E consists of 7 outdoor and 5 indoor sequences that each exceed 200 seconds, consisting of well temporal synchronized and spatial calibrated high-frequency IMU, high-quality stereo camera, and 360 degree LiDAR data. Crucially, our effort exceeds previous attempts regarding dataset size, scene variability, and complexity. It has 4x as much average recording time as the pioneering EuRoC dataset. We also provide careful dataset analysis as well as baselines for collaborative SLAM and single counterparts. Data and more up-to-date details are found at https://github.com/PengYu-Team/S3E.
translated by 谷歌翻译